Note: In this chapter we learn sampling distributions.
- First we will look at a simple activity like example.
- We will have sections named “DETOUR #”, we will learn some brand name distributions in these sections.
Let’s begin…
Take a look at the bowl in the following Figure. It has a certain number of red and a certain number of white balls all of equal size. Furthermore, it appears the bowl has been mixed beforehand as there does not seem to be any particular pattern to the spatial distribution of red and white balls.
Let’s now ask ourselves, what proportion of this bowl’s balls are red?
One way to answer this question would be to perform an exhaustive count: remove each ball individually, count the number of red balls and the number of white balls, and divide the number of red balls by the total number of balls. However this would be a long and tedious process.
Observe that ____ of the balls are red and there are a total of ____ balls and thus ___ % of the shovel’s balls are red. We can view the proportion of balls that are red in this shovel as a guess of the proportion of balls that are red in the entire bowl. While not as exact as doing an exhaustive count, our guess of ___% took much less time and energy to obtain.
However, say, we started this activity over from the beginning. In other words, we replace the 50 balls back into the bowl and start over. Would we remove exactly 17 red balls again? In other words, would our guess at the proportion of the bowl’s balls that are red be exactly 34% again? Maybe?
What if we repeated this exercise several times? Would I obtain exactly 17 red balls each time? In other words, would our guess at the proportion of the bowl’s balls that are red be exactly 34% every time? Surely not.
Let’s try do this on the computer…
To this end, we use a data frame bowl in the moderndive package whose rows correspond exactly with the contents of the actual bowl.
head(bowl)
# A tibble: 6 x 2
ball_ID color
<int> <chr>
1 1 white
2 2 white
3 3 white
4 4 red
5 5 white
6 6 white
# View(bowl) # Use this in the console
Observe in the output that bowl has ___ rows, telling us that the bowl contains ___ equally-sized balls. The first variable ball_ID is used merely as an “identification variable”, none of the balls in the actual bowl are marked with numbers. The second variable color indicates whether a particular virtual ball is red or white.
Now that we have a virtual analogue of our bowl, we now need a virtual analogue for the shovel seen in Figure 2; we’ll use this virtual shovel to generate our virtual random samples of 50 balls. We’re going to use the rep_sample_n() function included in the moderndive package. This function allows us to take repeated, or replicated, samples of size n. Run the following and explore.
virtual_shovel <- bowl %>%
rep_sample_n(size = 50)
virtual_shovel
# A tibble: 50 x 3
# Groups: replicate [1]
replicate ball_ID color
<int> <int> <chr>
1 1 1369 red
2 1 1759 white
3 1 999 red
4 1 667 white
5 1 129 red
6 1 796 white
7 1 572 white
8 1 1980 white
9 1 1533 red
10 1 772 red
# … with 40 more rows
Next we can find out how many res ones are there in our virtual_shovel
virtual_shovel %>%
summarize(num_red = sum(color=="red"))
# A tibble: 1 x 2
replicate num_red
<int> <int>
1 1 16
How about the proportion on red? We can use the mutate (new) function to create a new variable, in this case prop_red.
virtual_shovel %>%
summarize(num_red = sum(color == "red")) %>%
mutate(prop_red = num_red / 50)
# A tibble: 1 x 3
replicate num_red prop_red
<int> <int> <dbl>
1 1 16 0.32
virtual_samples <- bowl %>%
rep_sample_n(size = 50, reps = 30)
Observe that while the first 50 rows of replicate are equal to 1, the next 50 rows of replicate are equal to 2. This is telling us that the first 50 rows correspond to the first sample of 50 balls while the next 50 correspond to the second sample of 50 balls. This pattern continues for all reps = 30 replicates and thus virtual_samples has \(30 \times 50 = 1500\) rows.
virtual_prop_red <- virtual_samples %>%
group_by(replicate) %>%
summarize(red = sum(color == "red")) %>%
mutate(prop_red = red / 50)
virtual_prop_red
# A tibble: 30 x 3
replicate red prop_red
<int> <int> <dbl>
1 1 21 0.42
2 2 21 0.42
3 3 21 0.42
4 4 22 0.44
5 5 15 0.3
6 6 15 0.3
7 7 19 0.38
8 8 16 0.32
9 9 14 0.28
10 10 18 0.36
# … with 20 more rows
#kable(virtual_prop_red) # To see all 30 samples
Let’s visualize the distribution of these 33 proportions red based on 33 virtual samples using a histogram with binwidth = 0.05
ggplot(virtual_prop_red, aes(x = prop_red)) +
geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white", fill = "steelblue") +
labs(x = "Proportion of 50 balls that were red",
title = "Distribution of 30 proportions red")
Observe that occasionally we obtained proportions red that are less than ____, while on the other hand we occasionally we obtained proportions that are greater than ____. However, the most frequently occurring proportions red out of 50 balls were between ____ % and ____ % (for ___ out 30 samples). Why do we have these differences in proportions red? Because of ___________________.
Exercise 1.1 Redo the above activity with 1000 repeated samples and state your conclusions.
If your goal was still to estimate the proportion of the bowl’s balls that were red, which shovel would you choose? Why? Let’s try to anser these questions.
# Segment 1: sample size = 25 ------------------------------
# 1.a) Virtually use shovel 1000 times
virtual_samples_25 <- bowl %>%
rep_sample_n(size = 25, reps = 1000)
# 1.b) Compute resulting 1000 replicates of proportion red
virtual_prop_red_25 <- virtual_samples_25 %>%
group_by(replicate) %>%
summarize(red = sum(color == "red")) %>%
mutate(prop_red = red / 25)
# 1.c) Plot distribution via a histogram
p1 <- ggplot(virtual_prop_red_25, aes(x = prop_red)) +
geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white") +
labs(x = "Pro of 25 balls that were red", title = "25")
# Segment 2: sample size = 50 ------------------------------
# 2.a) Virtually use shovel 1000 times
virtual_samples_50 <- bowl %>%
rep_sample_n(size = 50, reps = 1000)
# 2.b) Compute resulting 1000 replicates of proportion red
virtual_prop_red_50 <- virtual_samples_50 %>%
group_by(replicate) %>%
summarize(red = sum(color == "red")) %>%
mutate(prop_red = red / 50)
# 2.c) Plot distribution via a histogram
p2 <- ggplot(virtual_prop_red_50, aes(x = prop_red)) +
geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white") +
labs(x = "Pro of 50 balls that were red", title = "50")
# Segment 3: sample size = 100 ------------------------------
# 3.a) Virtually using shovel with 100 slots 1000 times
virtual_samples_100 <- bowl %>%
rep_sample_n(size = 100, reps = 1000)
# 3.b) Compute resulting 1000 replicates of proportion red
virtual_prop_red_100 <- virtual_samples_100 %>%
group_by(replicate) %>%
summarize(red = sum(color == "red")) %>%
mutate(prop_red = red / 100)
# 3.c) Plot distribution via a histogram
p3 <- ggplot(virtual_prop_red_100, aes(x = prop_red)) +
geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white") +
labs(x = "Pro of 100 balls that were red", title = "100")
plot_grid(p1, p2, p3, nrow = 1)
Observe that as the sample size increases, the ______ of the 1000 replicates of the proportion red decreases. In other words, as the sample size increases, there are less differences due to sampling variation and the distribution centers more tightly around the same value. Eyeballing the above Figure, things appear to center tightly around roughly ____%.
# n = 25
virtual_prop_red_25 %>%
summarize(sd = sd(prop_red))
# A tibble: 1 x 1
sd
<dbl>
1 0.0981
# n = 50
virtual_prop_red_50 %>%
summarize(sd = sd(prop_red))
# A tibble: 1 x 1
sd
<dbl>
1 0.0674
# n = 100
virtual_prop_red_100 %>%
summarize(sd = sd(prop_red))
# A tibble: 1 x 1
sd
<dbl>
1 0.0484
| Number of slots in shovel | Standard deviation of proportions red |
|---|---|
| 25 | 0.0978 |
| 50 | 0.0669 |
| 100 | 0.0471 |
As the sample size increases our numerical measure of spread decreases; there is less variation in our proportions red. In other words, as the sample size increases, our guesses at the true proportion of the bowl’s balls that are red get more consistent and precise.
This was our first attempt at understanding two key concepts relating to sampling for estimation:
Let’s now introduce some terminology and notation as well as statistical definitions related to sampling.
(Study) Population: A (study) population is a collection of individuals or observations about which we are interested. We mathematically denote the population’s size using upper case N. In our simulations the (study) population was the collection of N = 2400 identically sized red and white balls contained in the bowl.
Population parameter: A population parameter is a numerical summary quantity about the population that is unknown, but you wish you knew. For example, when this quantity is a mean, the population parameter of interest is the population mean which is mathematically denoted with the Greek letter \(\mu\) (pronounced “mu”). In our simulations however since we were interested in the proportion of the bowl’s balls that were red, the population parameter is the population proportion which is mathematically denoted with the letter \(p\).
Census: An exhaustive enumeration or counting of all \(N\) individuals or observations in the population in order to compute the population parameter’s value exactly. In our simulations, this would correspond to manually going over all \(N = 2400\) balls in the bowl and counting the number that are red and computing the population proportion \(p\) of the balls that are red exactly. When the number \(N\) of individuals or observations in our population is large, as was the case with our bowl, a census can be very expensive in terms of time, energy, and money.
Sampling: Sampling is the act of collecting a sample from the population when we don’t have the means to perform a census. We mathematically denote the sample’s size using lower case \(n\), as opposed to upper case \(N\) which denotes the population’s size. Typically the sample size \(n\) is much smaller than the population size \(N\), thereby making sampling a much cheaper procedure than a census. In our simulations, we used shovels with 25, 50, and 100 slots to extract a sample of size \(n = 25\), \(n = 50\), and \(n = 100\) balls.
Point estimate (AKA sample statistic): A summary statistic computed from the sample that estimates the unknown population parameter. In our simulations, recall that the unknown population parameter was the population proportion and that this is mathematically denoted with p. Our point estimate is the sample proportion: the proportion of the shovel’s balls that are red. In other words, it is our guess of the proportion of the bowl’s balls balls that are red. We mathematically denote the sample proportion using \(\hat{p}\); the “hat” on top of the p indicates that it is an estimate of the unknown population proportion \(p\).
Representative sampling: A sample is said be a representative sample if it is representative of the population. In other words, are the sample’s characteristics a good representation of the population’s characteristics? In our simulations, are the samples of \(n\) balls extracted using our shovels representative of the bowl’s \(N = 2400\) balls?
Generalizability: We say a sample is generalizable if any results based on the sample can generalize to the population. In other words, can the value of the point estimate be generalized to estimate the value of the population parameter well? In our simulations, can we generalize the values of the sample proportions red of our shovels to the population proportion red of the bowl? Using mathematical notation, is \(\hat{p}\) a “good guess” of \(p\)?
Bias: In a statistical sense, we say bias occurs if certain individuals or observations in a population have a higher chance of being included in a sample than others. We say a sampling procedure is unbiased if every observation in a population had an equal chance of being sampled. In our simulations, since each ball had the same size and hence an equal chance of being sample in our shovels, our samples were unbiased.
Random sampling: We say a sampling procedure is random if we sample randomly from the population in an unbiased fashion. In our simulations, this would correspond to sufficiently mixing the bowl before each use of the shovel.
Let’s put them all together:
If we extract a sample of \(n=50\) balls at random, in other words we mix the equally-sized balls before using the shovel, then
the contents of the shovel are an unbiased representation of the contents of the bowl’s 2400 balls, thus
any result based on the sample of balls can generalize to the bowl, thus
the sample proportion \(\hat{p}\) of the \(n=50\) balls in the shovel that are red is a “good guess” of the population proportion \(p\) of the \(N =2400\) balls that are red, thus
instead of manually going over all the balls in the bowl, we can infer about the bowl using the shovel.